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Abstract 

In this paper, we study the zero-delay source-channel coding problem, and specifically the problem of obtaining the vector 
transformations that optimally map between the ra-dimensional source space and the fc-dimensional channel space, under a given 
transmission power constraint and for the mean square error distortion. We first study the functional properties of this problem 
and show that the objective is concave in the source and noise densities and convex in the density of the input to the channel. 
We then derive the necessary conditions for optimality of the encoder and decoder mappings. A well known result in information 
theory pertains to the linearity of optimal encoding and decoding mappings in the scalar Gaussian source and channel setting, at 
all channel signal-to-noise ratios (CSNRs). In this paper, we study this result more generally, beyond the Gaussian source and 
channel, and derive the necessary and sufficient condition for linearity of optimal mappings, given a noise (or source) distribution, 
and a specified power constraint. We also prove that the Gaussian source-channel pair is unique in the sense that it is the only 
source-channel pair for which the optimal mappings are linear at more than one CSNR values. Moreover, we show the asymptotic 
linearity of optimal mappings for low CSNR if the channel is Gaussian regardless of the source and, at the other extreme, for high 
CSNR if the source is Gaussian, regardless of the channel. Our numerical results show strict improvement over prior methods. The 
numerical approach is extended to the scenario of source-channel coding with decoder side information. The resulting encoding 
mappings are shown to be continuous relatives of, and in fact subsume as special case, the Wyner-Ziv mappings encountered in 
digital distributed source coding systems. 

Index Terms 

Joint source channel coding, analog communications, estimation, distributed coding. 

I. Introduction 

A fascinating result in information theory is that uncoded transmission of Gaussian samples, over a channel with additive 
white Gaussian noise (AWGN), is optimal in the sense that it yields the minimum achievable mean square error (MSE) between 
source and reconstruction |1|. This result demonstrates the potential of joint source-channel coding: Such a simple scheme, at 
no delay, provides the performance of the asymptotically optimal separate source and channel coding system, without recourse 
to complex compression and channel coding schemes that require asymptotically long delays. However, it is understood that 
the best source channel coding system at fixed finite delay may not, in general, achieve Shannon's asymptotic coding bound 
(see e.g. (2)). 

Clearly, the problem of obtaining the optimal scheme for a given finite delay is an important open problem with considerable 
practical implications. There are two main approaches to the practical problem of transmitting a discrete time continuous 
alphabet source over a discrete time additive noise channel: "analog communication" via direct amplitude modulation, and 
"digital communication" which typically consists of quantization, error control coding and digital modulation. The main 
advantage (and hence proliferation) of digital over analog communication is due to advanced quantization and error control 
techniques, as well as the prevalence of digital processors. However, there are two notable shortcomings: First, error control 
coding (and to some extent also source coding) usually incurs substantial delay to achieve good performance. The other problem 
involves limited robustness of digital systems to varying channel conditions, due to underlying quantization or error protection 
assumptions. The performance saturates due to quantization as the channel signal to noise ratio (CSNR) increases beyond 
the regime for which the system was designed. Also, it is difficult to obtain "graceful degradation" of digital systems with 
decreasing CSNR, when it falls below the minimum requirement of the error correction code in use. Further, such threshold 
effects become more pronounced as the system performance approaches the theoretical optimum. Analog systems offer the 
potential to avoid these problems. As an important example, in applications where significant delay is acceptable, a hybrid 
approach (i.e., vector quantization + analog mapping) was proposed and analyzed |3}, Q to circumvent the impact of CSNR 
mismatch where, for simplicity, linear mappings were used and hence no optimality claims made. Perhaps more importantly, 
in many applications delay is a paramount consideration. Analog coding schemes are low complexity alternatives to digital 
methods, providing a "zero-delay" transmission which is suitable for such applications. 

There are no known explicit methods to obtain such analog mappings for a general source and channel, nor is the optimal 
mapping known, in closed form, for other than the trivial case of the scalar Gaussian source-channel pair. Among the few 
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Fig. 1. A general block-based point-to-point communication system 



practical analog coding schemes that have appeared in the literature are those based on the use of space-filling curves for 
bandwidth compression, originally proposed more than 50 years ago by Shannon |5| and Kotelnikov (6). These were then 
extended in the work of Fuldseth and Ramstad [7], Chung (8), Vaishampayan and Costa [9|, Ramstad flTOj, and Hekland 



et.al. 1 11 1, where spiral-like curves were explored for transmission of Gaussian sources over AWGN channels for bandwidth 
compression (m > k) and expansion (m < k). 

There exist two main approaches to numerical optimization of the mappings. One is based on optimizing the parameter set 
of a structured mapping [10]-[13|. The performance of this approach is limited to the parametric form (structure) assumed. For 
instance, Archimedian spiral based space filling mappings proposed for bandwidth compression, are known to perform well at 
high CSNRs. The other is based on power constrained channel optimized vector quantization (PCCOVQ) where a "discretized 
version" of the problem is tackled using tools developed for vector quantization [7], |14|, [15]. 

It is also noteworthy that a similar problem was solved in [ fT6| , (T7) albeit under the stringent constraint that both encoder 
and decoder be linear. A related problem, formulated in the pure context of digital systems, was also studied by Fine (18) . 
Properties of the optimal mappings have been considered, over the years, in (5), (20). Shannon's arguments (5) are based 
on the topological impossibility to map between regions in a "one-to-one", continuous manner, unless they have the same 
dimensionality. On this basis, he explained the threshold effect common to various communication systems. Moreover, Ziv 
p9) showed that for a Gaussian source transmitted over AWGN channel, no single practical modulation scheme can achieve 
optimal performance at all noise levels, if the channel rate is greater than the source rate (i.e., bandwidth expansion). It has 
been conjectured that this result holds whenever the source rate differs from the channel rate [20]. Our own preliminary results 
appeared in (21), (22). 



The existence of optimal real time encoders have been studied in [ 23 1-[ 25 1 for encoding a Markov source with zero-delay. 



Along these lines, for similar set of problems, |24| demonstrated the existence the optimal causal encoders using dynamic 
programming, its results are recently extended to partially observed Markov sources and multi terminal settings in (26) . The 
problem we consider is intrinsically connected to problems in stochastic control where the controllers must operate at zero 
delay. A control problem, similar to the zero-delay source channel coding problem here, is the Witsenhausen's well known 
counterexample [27] (see e.g. [28] for a comprehensive review) where a similar functional optimization problem is studied 
and it is shown that nonlinear controllers can outperform linear ones in decentralized control settings even under Gaussianity 
and MSE assumptions. 

In this paper, we investigate the problem of obtaining vector transformations that optimally map between the m-dimensional 
source space and the fc-dimensional channel space, under a given transmission power constraint, and where optimality is in the 
sense of minimum mean square reconstruction error. We provide necessary conditions for optimality of the mappings used at 
the encoder and the decoder. It is important to note that virtually any source-channel communication system (including digital 
communication) is a special case of the general mappings shown in Figure 1. A typical digital system, including quantization, 
error correction and modulation, boils down to a specific mapping from the source space R m to the channel space E fe and 
back to reconstruction space E m at the receiver. Hence the derived optimality conditions are generally valid and subsume 
digital communications as an extreme special case. Based on the optimality conditions we derive, we propose an iterative 
algorithm to optimize the mappings for any given m, k (i.e., for both bandwidth expansion or compression) and for any given 
source-channel statistics. To our knowledge, this problem has not been fully solved, except when both source and channel are 
scalar and Gaussian. We provide examples of such m : k mappings for source-channel pairs and construct the corresponding 
source-channel coding systems that outperform the mappings obtained in [|7)-(TT|. 

We also study the functional properties of the zero delay source-channel coding problem. Specifically, first we show that the 
end-to-end mean square error is a concave functional of the source density given fixed noise density, and of the noise density 
given a fixed source. Secondly, for the scalar version of the problem, the minimum mean square error is a convex functional 
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of the channel input density. The convexity result makes the optimal encoding mapping "essentially unique" [[] and paves the 
way to the linearity results we present later. 

In digital communications, structured (linear, lattice etc.) codes (mappings) are generally optimal, i.e., the structure comes 
without any loss in performance. We observe that this is not the case at zero-delay. Hence, characterizing the conditions, in 
terms of the source, channel densities and the power constraint, that yield linearity of optimal mappings is an interesting open 
problem. Building on the functional properties we derive in this paper, and the recent results on conditions for linearity of 
optimal estimation J29) , we derive the necessary and sufficient conditions for linearity of optimal mappings. We then study 
the CSNR asymptotics and particularly show that given a Gaussian source, optimal mappings are asymptotically linear at high 
CSNR, irrespective of the channel. Similarly, for a Gaussian channel, optimal mappings are asymptotically linear at low CSNR 
regardless of the source. 

The last part of the paper extends the numerical approach to the scenario of source-channel coding with decoder side 
information (i.e., the decoder has access to side information that is correlated with the source). This setting, in the context 



of pure source coding, goes back to the pioneering work of Slepian and Wolf (30) and Wyner and Ziv 1 3 1 1. The derivation 
of the optimality conditions for the decoder side information setting is a direct extension of the point-to-point case, but the 
distributed nature of this setting results in highly nontrivial mappings. Straightforward numerical optimization of such mappings 
is susceptible to get trapped in numerous poor local minima that riddle the cost functional. Note, in particular, that in the case 
of Gaussian sources and channels, linear encoders and decoder (automatically) satisfy the necessary conditions of optimality 
while, as we will see, careful optimization obtains considerably better mappings that are far from linear. 

In Section II, we formulate the problem. In Section III, we study the functional properties of the problem and derive 
the necessary conditions for optimality of the mappings. We then provide an iterative algorithm based on these necessary 
conditions, in Section IV. In Section V, we derive the necessary and sufficient conditions for linearity of encoding and/or 
decoding mappings, in the particular instance of scalar mappings. We provide example mappings and comparative numerical 
results in Section VI. Discussion and future work are presented in Section VII. 



II. Problem Formulation 

A. Preliminaries and Problem Definitions 

Let R, N, R + , and C denote the respective sets of real numbers, natural numbers, positive real numbers and complex 
numbers. In general, lowercase letters (e.g., x) denote scalars, boldface lowercase (e.g., x) vectors, upper- case (e.g., U, X) 
matrices and random variables, and boldface uppercase (e.g., X) random vectors. E(-) and P( ) denote the expectation and 
probability operators, respectively. || • || denotes the 1% norm. V denotes the gradient and V x denotes the partial gradient with 
respect to x. f (•) denotes the first order derivative of the function /(•), i.e., / (x) = d ^j^ . All the logarithms in the paper 
are natural logarithms and may in general be complex. The integrals are in general Lebesgue integrals. 

Let Sj n denote the set of Borel measurable, square integrable functions {/ : W n — >• R k }. Let us define the set S + as the 
set of monotonically increasing deterministic, Borel measurable K — > K mappings. 

1) Point to point: We consider the general communication system whose block diagram is shown in Figure 1. An Tri- 
dimensional zero mearj^] vector source X £ W l is mapped into a fc-dimensional vector Y £ K fe by function g £ S^ n , and 
transmitted over an additive noise channel. The received vector Y = Y + Z is mapped by the decoder to the estimate X 
via function h £ S™. The zero mean noise Z is assumed to be independent of the source X. The m-fold source density is 
denoted fx(-) and the fc-fold noise density is fz(-) with characteristic functions Fx {<•>>) and Fz(lj), respectively. 

The objective is to minimize, over the choice of encoder g £ S, n and decoder h £ <5>™, the distortion 

D[g } h]=E{\\X -X\\ 2 }, (1) 

subject to the average power constraint, 

P[g}=E{\\g(X)\\ 2 }<P T , (2) 

where P? is the specified transmission power level. Bandwidth compression-expansion is determined by the setting of the 
source and channel dimensions, k/m. The power constraint limits the choice of encoder function </(■). Note that without a 
power constraint on g(-), the CSNR is unbounded and the channel can be made effectively noise free. 

Our goal is to minimize MSE subject to the average power constraint. Let us write MSE explicitly as a functional of g(-) 
and h(-) 

D{g,h)= J J [x-h(g(x) + z)} T [x-h(g(x) + z)}f x (x)f z (z)dxdz (3) 

1 The optimal mapping is not strictly unique, in the sense that multiple trivially "equivalent" mappings can be used to obtain the same channel input density. 
For example, a scalar unit variance Gaussian source and scalar Gaussian channel with power constraint P, can be optimally encoded by either y = \/Px or 
y = —y/Px. 

2 The zero mean assumption is not necessary, but it considerably simplies the notation. Therefore, it is made throughout the paper. 



SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 



4 



Z e 



Xi e M. mi 


Encoder 


Y £ R k 




g : R mi -> R k 





Y G 





Decoder 




h : 













X x £ 



Fig. 2. Source-channel coding with decoder side information 



To impose the power constraint, we construct the Lagrangian cost functional: 

J(g,h) = D(g,h)+X{P(g)} (4) 

to minimize over the mapping g(-) and h(-). 

2) Decoder side information: As shown in Figure 2, there are two correlated vector sources Xi £ W ni and X 2 £ M'™ 2 with 
a joint density fx lt X 2 i x ii x 2)- X 2 is available only to the decoder, while X-y is mapped to Y £ M. k by the encoding function 
g £ and transmitted over the channel whose additive noise Z £ M. k , with density fz(-), is independent of Xx, X 2 . The 
received channel output Y = Y + Z is mapped to the estimate X± by the decoding function h : K fc x R 7 ™ 2 — > K™. The 
problem is to find optimal mapping functions g, h that minimize the distortion 

D(g,h)=E{\\X 1 -X 1 \\ 2 } (5) 

subject to average power constraint identical to 



B. Asymptotic Bounds for Gaussian Source and Channel 

Although the problem we consider is delay limited, it is insightful to consider asymptotic bounds obtained at infinite delay. 
From Shannon's source and channel coding theorems, it is known that, asymptotically, the source can be compressed to R(D) 
bits (per source sample) at distortion level D, and that C bits can be transmitted over the channel (per channel use) with 
arbitrarily low probability of error, where R(D) is the source rate-distortion function, and C is the channel capacity, (see e.g. 
0). The asymptotically optimal coding scheme is the tandem combination of the optimal source and channel coding schemes, 
hence mR(D) < kC must hold. By setting 

R(D) = -C, (6) 

m 

one obtains a lower bound on the distortion of any source-channel coding scheme. Next, we specialize to Gaussian sources and 
channels, which we will mostly use in the numerical results section, while emphasizing that the proposed method is generally 
applicable to any source and noise densities. The rate-distortion function for the memoryless Gaussian source of variance a^, 
under the squared-error distortion measure is given by 

R(D) =max(0,^log^), (7) 
for any distortion value D > 0. The capacity of the AWGN channel is given by 

c7=ilog(l + % (8) 

where Pt is the transmission power constraint and a 2 z is the noise variance. Plugging (|7jl and ^ in (j6j), we obtain the optimal 
performance theoretically attainable (OPTA): 

dopta - fI f| yf . m 

For source coding with decoder side information, it has been established for Gaussians and MSE distortion that there is no 
rate loss due to the fact that the side information is unavailable to the encoder [31 1. Similar to the derivation above, OPTA 
can be obtained for source-channel coding with decoder side information, by equating the conditional rate distortion function 
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of the source (given the side information) to the channel capacity. The rate distortion function of X\ when X2 serves as side 

" 1 P " 
P 1 



information and [Xi,^] ~ Af(0,Rx) where Rx = (rj, 



with \p\ < 1 is: 



R(D) = max(0, - log ^ (">) 

Similar to point-to-point setting, we plug ( fT~0] > and ([HJ in ([6]) to obtain OPTA 

_ (1 - p 2 )o\ 

DoPTA ~ (i + ^)Jf (U) 

"z 

Note that OPTA is derived without any delay constraints and may not be achievable by a delay-constrained coding scheme. 
No achievable theoretical bound is known for joint source channel coding with limited delay, although there are recent results 
that tighten the outer bound, see e.g. [32|-|34|. 



III. Functional Properties of Zero-Delay Source-Channel Coding Problem 

In this section, we study the functional properties of the optimal zero-delay source-channel coding problem. These properties 
are not only important in their own right, but also, as we will show in the following sections, enable the derivation of several 
subsequent results. 

Let us restate the Lagrangian cost as J(X,Z,g,h) which makes explicit its dependence on the source and channel 
noise X ~ fx(-) and Z ~ fz(-), beside the deterministic mappings g(-) and h(-) as: 

J(X,Z,g,h) = E {\\X - h(g(X) + Z)\\ 2 } + XE {\\g(X)\\ 2 } (12) 

The minimum achievable cost is 

J m (X,Z) ±mf J(X,Z,g,h) (13) 

Similarly, conditioned on another random variable U, J m (X,Z\U) denotes the overall cost when U is available to both 
encoder and decoder. We define J r as the value of overall cost as a function of g(-), when h(-) had already been optimized 
forg(-): 

J r (X,Z,g)^MJ(X,Z,g,h) (14) 

h 



A. Concavity of J m in fx(') an d fz{') 

In this section, we show the concavity of the minimum cost, J m in the source density fx(') an d m me channel noise density 
fz()- Similar results were derived for the MMSE estimation in the scalar setting, in [35] , where no encoder is present in the 
problem formulation. We start with the following lemma which states the impact of conditioning on the overall cost. 

Lemma 1 Conditioning cannot increase the overall cost, J m i.e., J m (X , Z) > J m (X, Z\U) for any U. 

Proof: The knowledge of U cannot increase the total cost, since we can always ignore U and use the g(-),h(-) pair that 
is optimal for J m (X, Z). Hence, J m {X, Z\U) < J m (X, Z). ■ 
Using Lemma [I] we prove the following theorem which states the concavity of the minimum cost J m (X , Z). 

Theorem 1 J m is concave in fx(') an d /z( - )- 

Proof: Let X be distributed according to fx = pfxi + (1 ~ P)fx 2 < where fx l and fx 2 respectively denote the densities 
of random variables X\ and X^- Then, X can be expressed, in terms of a time sharing random variable U which takes values 
in the alphabet {1, 2}, with ¥{U = 1} = p: X = X v . Then, we have 

J m (X,Z) >J m (X,Z\U) (15) 
= P J m (X 1 , Z) + (1 - p)J m (X 2 ,Z) (16) 



which proves the concavity of J m (X, Z) for fixed fz- Similar arguments on Z prove that J m (X, Z) is also concave in fz 
for fixed fx- ■ 
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B. Convexity of J r in fy(-) 

In this section, we assume that the source and the channel are scalar, i.e., m — k—1, for simplicity, although our results can 
be extended to higher matched dimensions, m = fc,Vra, k £ N. We show the convexity of J(X, Z, g, h) in the channel input 
density fy{) ofY = g(X), when h(-) is optimized for g(-). An important distinction to make is that convexity in <?(•) is not 
implied. A trivial example to demonstrate non-convexity in g(-) is the scalar Gaussian source and channel setting, where both 

Y = J^-X and Y = -J^fX are optimal (when used in conjunction with their respective optimal decoders). This example 

V a x V a X 

also leads to the intuition that the cost functional may be "essentially" convex (i.e., convex up to the sign of <?(•)) although it 
is clearly not convex in the strict sense. It turns out that this intuition is correct: J r (X, Z,g) is convex in /y(-). 

Towards showing convexity, we first introduce the idea of probabilistic (random) mappings, similar, in spirit, to the random 
encoders used in the coding theorems (36), p7[ . We reformulate the mapping problem by allowing random mappings, i.e., we 
relax the mapping from a deterministic function Y = g{X) to a probabilistic transformation, expressed as f Y \x( x >v)- Note 
that similar relaxation to stochastic settings have been used in the literature, e.g. recently in [38 1. We define this "generalized" 
mapping problem as: minimize J gen (X, Y, Z) over the conditional density fy\x where the cost functional J gen is defined as 

J gen {X,Y, Z) 4 inf E{(X - h{Y + Z)) 2 } + XE{Y 2 }. (17) 

h 

We first need to show that this relaxation does not change the solution space. This is done via the following lemma. 



inf J gen (X,Y,Z). 

1Y\X 



Lemma 2 7~ /y(") which minimizes {17) is a deterministic function of the input, Y = g(X), i.e., J m (X,Z) = inf J r (X, Z, g) 



Proof: Let us first define an auxiliary function 

G(X,Y,Z)± (X - h(Y + Z)) 2 + Y 2 . (18) 

Next, we observe that 

inf inf J gen (X,Y,Z) = inf / / f x (x) inf ( / G{X,Y,Z)f Y]x (x,y)dy\dxf z (z)dz (19) 

h fv\x h J J f Y \x U J 

= inf / f x (x) inf G(x,y,z)dxfz(z)dz, (20) 
hi. y 



where (20i is due to the fact that the minimizing fyix simply allocates all probability to the value y which minimize G{x, y, z). 
Hence, for any fixed h(-), the minimizing fy\x is deterministic. Using the optimal h(-) as the fixed h(-) in (20i, we show 
that the optimal Y ~ jy(-) is a deterministic function: Y = g{X). 

■ 

Next, we proceed to show that the generalized mapping problem is convex in fy{ ). To this aim, we show that J gen can be 
written in terms of a known metric in probability theory, Wasserstein metric |39| and use its functional properties. First, we 
present the definition and some important properties of this metric. 

Wasserstein metric is a metric defined on the quadratic Wasserstein spacaj^CR), defined for S, Q € 7-2 (K) as 

W 2 (S,Q) - inf {\\X -Y\\ 2 :X~S,Y~Q}, (21) 

where \\X - Y\\ 2 = y/E{{X-Y) 2 } and the infimum is over the joint distribution of X and Y. 

The W2 metric measures the convergence in distribution and second order moments, i.e., Ws(Sx k , S x ) converges to zero if 
and only if Xk converges to X in distribution and E{X|} converges to E{JT 2 }. The following properties of this metric will 
be used to derive the subsequent results. 

Lemma 3 ( (39)) W 2 {S, Q) satisfies the following properties: 

1) The metric W 2 {S, Q) is lower semi-continuous in both S and Q. 

2) For a given S, W 2 (S, Q) is convex in Q. 

Next, we present our main result in this section. Hereafter, we limit the space of decoding functions to S + , i.e., monotone 



increasing, without any loss of generality (see e.g., [35]). 



Theorem 2 J r is convex in fy(-) and hence the solution to the mapping problem is unique in /y(-). 

Proof: We will first express J m (X, Y) as a minimization over /y (•)• Let us define V = h(Y + Z) for a fixed h(-). Then, 
using Lemma g J m (X,Y) can be re-written as 

J m (X,Y) = inf inf E{(X - V) 2 } + inf AEjT 2 } (22) 

h fv\x fv\x 



3 The quadratic Wasserstein space on R is defined as the collection of all Borel probability measures with finite second moments, denoted by 7^2 (K). 
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which is 

J m (X,Y) = infinf {Wi(fxJv)} + AinfE{F 2 }. (23) 

h fv JY 

The first term in the right hand side of (23 i is convex in fy since, W^i fx, fv) is convex in fy (due to Lemma [^-property 
2) when fx is fixed, and the pointwise minimizer of a convex function is convex. Since Y and V are related in one-to-one 
manner through V = h(Y + Z) and h(-) 6 S + , this term is convex also in fy. Since E{y 2 } is linear in /y(-), we conclude 
that J m is the infimum of a convex functional of fy(-), where the infimum is taken over fy(-). Hence, the solution is unique 
in /y(-). 

Given that the solution is unique, we can express J m as 

J m (X,Z) = MJ r (X,Z,g) (24) 

JY 

a.e. in X and Z, where Y — g(X). Hence the functional we are interested is indeed J r (X,Z,g) which is convex /y( ). 

■ 

A practically important consequence of Theorem [2] is stated in the following corollary. 

Corollary 1 J r is convex in g(-) where g(-) € S + (i.e., g(-) is monotone increasing). 

Proof: There is one-to-one mapping between Y and the encoder g(-) as Fx{X) = J-y(g(X)) where Fx and Ty denote 
the cumulative distribution functions of X and Y respectively.lt follows from Theorem|2]that for any fy 1 and fy 2 and 1 > a > 

aMfy) + (1 - a)J r (fy 2 ) > J r (af Yl + (1 - a) fa). (25) 
Since J r (fy) is achieved by a unique g(-) € Q + , this implies that 

aJ r (f gi ) + (1 - a)J r (f g2 ) > J r (af gi + (1 - a)f g2 ), (26) 

which shows the convexity of J r in g(-), where g(-) G Q + . ■ 

Remark 1 Note that the optimal mappings, i.e., the mappings that achieve the infimum in ( |77| ) exist. To see this, we use 
the semi-lower continuity property of the W2(S, Q) in both S and Q as given in Lemma^ The set of Y is compact since 
E{K 2 } < Pt, hence the infimum in the problem definition is achievable. This result guarantees that any algorithm based on 
the necessary conditions of optimality will converge to a globally optimal solution. 

C. Optimality Conditions 

We proceed to develop the necessary conditions for optimality of the encoder and decoder subject to the average power 
constraint in the general setting of m, k E N. 

1) Optimal Decoder Given Encoder: Let g(-) be fixed. Then the optimal decoder is the MMSE estimator of X given 
Y = y, i.e., 

h(y) = E{X\y} (27) 

Plugging the expressions for expectation, we obtain 

h(y) = J x f xlY (x,y)dx. 

Applying Bayes' rule 



(28) 



txv^v)- Jfx ^ )hixM)dx (29) 

and noting that fytx( x > V) = fz\y — q{ x )]^ m e optimal decoder can be written, in terms of known quantities, as 

h(v) - I x fx{x)f z [y-g{x)]&x 

(y) ~ J fx(x)f z [y-g(x)]dx ' (3U) 
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2) Optimal Encoder Given Decoder: Let h(-) be fixed. To obtain necessary conditions we apply the standard method in 
variational calculus f40) : 

J\g(x)+eri(x),h]=0, (31) 

e=0 



tk 



i.e., we perturb the cost functional for all admissibl^j variation functions rj(x). Since the power constraint is accounted for in 
the cost function, the variation function rj(-) needs not be restricted to satisfy the power constraint (all measurable functions 



r\ : K m — > M. k are admissible). Applying d31l, we get 



J I.Xg(x) - J h'(g(x) +z)[x- h(g(x) + z)]f z (z)dz} rj(x)f x (x)dx = (32) 

where h (•) denotes the Jacobian of the vector valued function h(-). Equality for all admissible variation functions, r)(-), 
requires the expression in braces to be identically zero (more formally the functional derivative [40] vanishes at an extremum 
point of the functional). This gives the necessary condition for optimality as 

V g J{g,h}=0, (33) 

where 

V g J[g,h] = \f x (x)g(x)- f x (x) J h'(g(x)+ z)[x-h(g(x)+z)]f z (z)dz (34) 

Remark 2 Unlike the decoder, the optimal encoder is not in closed form. 
We summarize the results in this section, in the following theorem. 

Theorem 3 Given source and noise densities, a coding scheme (g(-),h(-)) is optimal only if 

g(x) = ^ J h\g(x)+z)[x-h(g(x)+z)]f z (z)dz (35) 

/ x fx fx(x) fz\v — q(x)] dx 
h(y) = J f ;/w ,\ J , (36) 

J fx(x)f z [y-g{x)\ dx 

where varying A provides solutions at different levels of power constraint Pt- In fact, A is the slope of the distortion-power 
curve: A = — -jp^- Moreover, if m = k = 1, these conditions are also sufficient for optimality. 

Proof: The necessary conditions were derived in (28), (31) and (32). The sufficiency in the case of m = k = 1 follows 
from Corollary 1. ■ 
The theorem states the necessary conditions for optimality but they are not sufficient in general, as is demonstrated in 
particular by the following corollary. 

Corollary 2 For Gaussian source and channel, the necessary conditions of Theorem [i] are satisfied by linear mappings 
g(x) = K e x and h(y) = K d y for some K e e R mxk , K d e R kxrn far any m, k e N. 



Proof: Linear mappings satisfy the first necessary condition, (35 i, regardless of the source and channel densities. Optimal 
decoder is linear in the Gaussian source and channel setting, hence the linear encoder-decoder pair satisfies both of the necessary 
conditions of optimality. ■ 

Although linear mappings satisfy the necessary conditions of optimality for the Gaussian case, they are known to be highly 
suboptimal when dimensions of source and channel do not match, i.e., m ^ k, see e.g. flT) . Hence, this corollary illustrates 
the existence of poor local optima and the challenges facing algorithms based on these necessary conditions. 

3) Extension to distributed settings: Optimality conditions for the setting of decoder side information can be obtained by 
following similar steps. However, they involve somewhat more complex expressions and are relegated to the appendix. We note, 
in particular, that for these settings a similar result to Corollary [2] holds, i.e., for Gaussian sources and channels linear mappings 
satisfy the necessary conditions. Perhaps surprisingly, even in the matched bandwidth case, e.g., scalar source, channel and side 
information, linear mappings are strictly suboptimal. This observation highlights the need for powerful numerical optimization 
tools. 

4 Our admissibility definition does not need to be very restrictive since it is used to derive a necessary condition. Hence, the only condition required for 
the admissible functions is to be (Borel) measurable, that the integrals exist, and that we can change the order of integration and differentiation. 



SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 



9 



IV. Algorithm Design 

The basic idea is to iteratively alternate between the imposition of individual necessary conditions for optimality, and thereby 
successively decrease the total Lagrangian cost. Iterations are performed until the algorithm reaches a stationary point. Imposing 
optimality condition for the decoder is straightforward, since the decoder can be expressed as closed form functional of known 
quantities, g(-), fx{') and fz(')- The encoder optimality condition is not in closed form and we perform steepest descent 
search in the direction of the functional derivative of the Lagrangian with respect to the encoder mapping g(-). By design, the 
Lagrangian cost decreases monotonically as the algorithm proceeds iteratively. The update for the various encoders is stated 
generically as 

9i+i( x ) = 9i(x) - vV g J[g,h], (37) 

where i is the iteration index, V s J[g, h] is the directional derivative, and \i is the step size. At each iteration i, the total cost 
decreases monotonically and iterations are continued until convergence. Previously proposed heuristic suboptimal mappings 
||8), fit)) can be used as initialization for the encoder mapping optimization. Note that there is no guarantee that an iterative 
descent algorithms of this type will converge to the globally optimal solution. The algorithm will converge to a local minimum. 
An important observation is that, in the case of Gaussian sources and channels, the linear encoder-decoder pair satisfies the 
necessary conditions of optimality, although, as we will illustrate, there are other mappings that perform better. Hence, initial 
conditions have paramount importance in such greedy optimizations. A preliminary low complexity approach to mitigate the 
poor local minima problem, is to embed in the solution the noisy relaxation method of pT[ , | |42| . We initialize the encoding 
mapping with random initial conditions and run the algorithm at very low CSNR (high Lagrangian parameter A). Then, we 
gradually increase the CSNR (decrease A) while tracking the minimum until we reach the prescribed CSNR (or power for 
a given channel noise level). The numerical results of this algorithm is presented in Section VI. 

V. On Linearity of Optimal Mappings 

In this section, we address the problem of "linearity" of optimal encoding and decoding mappings. Our approach builds on 
| [29) , where conditions for linearity of optimal estimation are derived, and the convexity result in Theorem 2 and the necessary 
conditions for optimality presented in Theorem 3. In this section, we focus on the scalar setting, m = k = 1, while our results 
can be extended to more general settings. 

A. Gaussian Source and Channel 

We briefly revisit the special case in which both X and Z are Gaussian, X ~ M(0 : cr\) and Z ~ J\f{Q,a%). It is well 
known that the optimal mappings are linear, i.e., g(X) = k e X and h(Y) = k^Y where k e and kd are given by 

(38) 

B. On Simultaneous Linearity of Optimal Encoder and Decoder 

In this section, we show that optimality requires that either both mappings be linear or that they both be nonlinear, i.e., a linear 
encoder with a nonlinear decoder, or a nonlinear encoder in conjunction with a linear decoder, are both strictly suboptimal. 
We show this in two steps in the following lemmas. 

Lemma 4 The optimal encoder is linear a.e. if the optimal decoder is linear. 




Proof: Let us plug h(y) = k^y for some kd G K in the necessary condition of optimality (35 1. Noting that h'(y) = kd 
a.e. in y, we have 



Xg(x) = k d J (x - k d g{x) - k d z)f z {z)dz (39) 

a.e. in x. Evaluating the integral and noting that E{Z} = 0, we have 

\g{x) = k d (x - k d g{x)) (40) 
a.e. and hence g(x) — . 7l a a: = k e x. ■ 

' d 

Lemma 5 The optimal decoder is linear a.e. if the optimal encoder is linear. 



Proof: Plugging g(x) = k e x for some k e 6 K in the necessary condition of optimality (35 i, we obtain 



A k e x = (x — h(k e x + z))h' (k e x + z)fz(z)dz (41) 
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a.e. in x. Since h(-) is a function from E — > R, Weierstrass theorem |43| guarantees that there is a sequence of real valued 
polynomials that uniformly converges to it: 



%)=limVa r ( J y (42) 

?. — ^-oo ' ■ 



2— J-OO 

r=0 



where a r (i) G K is the r polynomial coefficient of the z th polynomial. Since Weierstrass convergence is uniform in y, we 
can interchange the limit and summation and hence, 

oo 

h(y)=Y, a ry r ( 43 ) 

r=0 

in y, where a r = lim a r (i). Plugging (43 i in (pTl we obtain 

i— >oo 



J |s;-^a l (^ + z) , j (^2iai(k e x + z) 1 ^ fz(z)dz. (44) 
Interchanging the summation and integratioij^] 

oo oo oo „ 

— Xk e x + x — ia>i (k e x + z)^ 1 fz(z)dz ^'^^'^^ionaj (k e x + z) t ^ 1 (k e x + zf fz(z)dz. (45) 
i=o J i=o j=o ^ 

Note that the above equation must hold a.e. in x, hence the coefficients of x r must be identical for all r £ N. Opening up the 
expressions (k e x + zf^ 1 and (k e x + z) J via binomial expansion, we have the following set of equations 

EC > = EEE E Pjn 1 )-^!^ 1 --'} (46) 

i=r+l ^ ' i=0 j=0 Z=0 p=r-/+l ^ ' 

which must hold for all r > 2. 

We note that every equation introduces a new variable a r , so each new equation is linearly independent of its predecessors. 
Next, we solve these equations recursively, starting from r = 1. At each r, we have one unknown (a r ) which is related 
"linearly" to known constants. Since the number of linearly independent equations is equal to the number of unknowns for 
each r, there must exist a unique solution. We know that a r — 0, for all r > 2 is a solution to (46 1, so it is the only solution. 

■ 

Next, we summarize our main result pertaining to the simultaneous linearity of optimal encoder and decoder. 
Theorem 4 The optimal mappings are either both linear or they are both nonlinear. 

Proof: The proof directly follows from Lemma [4] and Lemma [5] ■ 

C. Conditions for Linearity of Optimal Mappings 

In this section, we study the condition for linearity of optimal encoder and/or decoder. Towards obtaining our main result, 
we will use the following auxiliary lemma. 



Lemma 6 The linear encoder and decoder in ( 38 1 satisfy the first of the necessary conditions of optimality (\35\ regardless of 
the source and channel densities. 



Proof: The proof directly follows from substitution of (38i in ( |35| >. ■ 
The following theorem presents the necessary and sufficient condition for linearity of optimal encoder and decoder mappings. 

Theorem 5 For a given power limit Py, noise Z with variance a 2 z and characteristic function Fz(uj), source X with variance 
a\ and characteristic function Fx(i*j), the optimal encoding and decoding mappings are linear if and only if 

F x (auj) = (47) 

where 7 = = 5f and a = . 

Proof: Theorem |4] states that the optimal encoder is linear if and only if optimal decoder is linear. Hence, we will only 
focus on the case where encoder and decoder are simultaneously linear. The first necessary condition is satisfied by Lemma 
[6] hence only the second necessary condition, < [36] > remains to be verified. 

00 oc 

5 Since the polynomials ai(k e x + z) 1 and ia; (k e x + z) 1 ^ 1 respectively converge to h{k e x + z) and h{k e x + z) uniformly in x and 2, and hence 

4=0 i=o 

both upper bounded in magnitude, we can use Lebesgue's dominated convergence theorem to interchange the summation and the integration. 
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Plugging g(X) = k e X and h(Y) = k^Y in (36 1, we have 



k d y = I x fx(^)fz(y~k e x)dx 
J fx{x)f z {y-k e x)dx' 

Expanding (48 I, we obtain 

kdy fx{x) fz{y - k e x) dx = x fx(x) f z (y - k e x) dx. (49) 



Taking the Fourier transform of both sides and via change of variables u = y — k e x, we have 

kd{u + k e x)fx(x)fz(u) exp(—jui(u + k e x))dxdu — J J x fx (x)fz{u) exp(— + k e x))dxdu (50) 
and rearranging the terms, we obtain 

F z (uj)F x (k e Lj) = F x {k e u)F' z {u). (51) 



1 fc e fc(i 

kp kd 



Noting that 



we have 



which implies 

The solution to this differential equation is 



5 = T^tV, (52) 



F' x {k e u) F' z {u) 
F x (k e u) 1 F z {lu) 

(log F x (ke»))' = (log iqiuj))'. (54) 



log F x {k e cj) = log Fj(u) + C (55) 
where C is constant. Noting that Fx(0) — F z (0) = 1, we determine C = and hence 

F x (k e uj) = FJ(oj). (56) 



Since the solution is essentially unique, due to Corollary 1, ( |56] l is not only a necessary but also the sufficient condition for 
linearity of optimal mappings. ■ 

D. Implications of the Matching Conditions 

In this section, we explore some special cases obtained by varying CNSR (i.e., 7) and utilizing the matching conditions for 
linearity of optimal mappings. We start with a simple but perhaps surprising result. 

Theorem 6 Given a source and noise of equal variance, identical to the power limit ( a\ = v z = Pt ), the optimal mappings 
are linear if and only if the noise and source distributions are identical, i.e., fx( x ) — f z (x),a.e. and in which case, the 
optimal encoder is g(X) = X and the optimal decoder is h(Y) = ^Y. 



Proof: It is straightforward to see from (47 1 that, at 7 = 1, the characteristic functions must be identical. Since the 



characteristic function uniquely determines the distribution |44|, fx{x) = fz(x), a.e.. ■ 

Remark 3 Note that Theorem^holds irrespective of the source (and channel) density, which demonstrates the departure from 
the well known example of scalar Gaussian source and channel. 

Next, we investigate the asymptotic behavior of optimal encoding and decoding functions at low and high CSNR. The results 
of our asymptotic analysis are of practical importance since they justify, under certain conditions, the use of linear mappings 
without recourse to complexity arguments at asymptotically high or low CSNR regimes. 

Theorem 7 In the limit 7 — > 0, the optimal encoding and decoding functions are asymptotically linear if the channel is 
Gaussian, regardless of the source. Similarly, as 7 — > 00, the optimal mappings are asymptotically linear if the source is 
Gaussian, regardless of tlie channel. 

Proof: The proof follows from applying the central limit theorem [44 1 to the matching condition ( |47| . The central limit 
theorem states that as 7 —> 00, for any finite variance noise Z, the characteristic function of the matching source F x (cj) = 
F^(uj/k e ) converges to the Gaussian characteristic function. Hence, at asymptotically high CSNR, any noise distribution is 
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matched by the Gaussian source. Similarly, as 7 —> and for any Fx(k e uj), FZ (k e oj) converges to the Gaussian characteristic 
function and hence the optimal mappings are asymptotically linear if the channel is Gaussian. ■ 
Let us next consider a setup with given source and noise variables and a power which may be scaled to vary the CSNR, 
7. Can the optimal mappings be linear at multiple values of 7 ? This question is motivated by the practical setting where 
7 is not known in advance or may vary (e.g., in the design stage of a communication system). It is well-known that the 
Gaussian source-Gaussian noise pair makes the optimal mappings linear at all 7 levels. Below, we show that this is the only 
source-channel pair for which the optimal mappings are linear at more than one CSNR value. 

Theorem 8 Given source and channel variables, let power P? be scaled to vary CSNR, 7. The optimal encoding and decoding 
mappings are linear at two different power levels Pi and P2 if and only if source and noise are both Gaussian. 

Remark 4 This theorem also holds for the setting where source or noise variables are scaled to change CSNR for a given 
power Pt- 

Proof: Let ji and 72 denote two CSNR levels, gi(X) = k ei X and g%{X) — k e2 X denote encoding mappings. Let the 
power be scaled by a 2 (a £ M. + ), i.e., P2 = o?P\ which yields 

72 = " 2 7i: k e2 = ak ei . (57) 



Using (47 1, we have 

F x {k ei u) = F^(u),F x (k e2 ij) = F^(oj). (58) 

Hence, 

F?(u) = F?(au>). (59) 



Taking the logarithm on both sides of ( 59 1, applying ( 57 1 and rearranging terms, we obtain 



a2 = logJMM (60) 



log F z {w) 



Note that (60 1 should be satisfied for both a and —a since they yield the same 7. Hence, Fz{aui) = Fz(—au>) for all 
a £ R, which implies Fz(oj) = Fz(—oS), a.e. in oj. Using the fact that the characteristic function is conjugate symmetric (i.e., 
Fz{— uj) = Fz(uj)), we get Fz(lo) £ K, a.e. in oj. As \ogFz(oj) is a function from K — » C, the Weierstrass theorem |43l 

guarantees that we can uniformly approximate logF^(w) arbitrarily closely by a polynomial ^ hut 1 , where fcj £ C. Hence, 

i=0 

by d60k we obtain: 



ki{ioa) 

=0 

00 

y~i ki 



2 »=0 //-in 

a = (61) 



a.e. in oj only if all coefficients fcj vanish, except for k-2, i.e., logF^(w) = /C2W 2 , or logFz(w) = a.e. in uj (the solution 
a = 1 is of no interest). The latter is not a characteristic function, and the former is the Gaussian characteristic function, 
Fz(uj) — e k2U , where we use the established fact that Fz(co) £ R. Since a characteristic function determines the distribution 
uniquely, the Gaussian source and noise must be the only allowable pair. ■ 



E. On the Existence of Matching Source and Channel 

Having discovered the necessary and sufficient condition as answer to the question of when optimal zero-delay encoding 
and decoding mappings are linear, we next focus on the question: when can we find a matching source (or noise) for a given 
noise (source)? Given a valid characteristic function Fz{oS), and for some 7 £ R + , the function F^(uj) may or may not be a 
valid characteristic function, which determines the existence of a matching source. For example, matching is guaranteed for 
integer 7 and it is also guaranteed for infinitely divisible Z. Conditions on 7 and Fz{lu) for F^(uj) to be a valid characteristic 
function were studied detail in J29), to which we refer for brevity and to avoid repetition. 



VI. Numerical Results 

We implement the algorithm described in Section IV by numerically calculating the derived integrals. For that purpose, we 
sample the source and noise distributions on a uniform grid. We also impose bounded support (—5(7 to +5er) i.e., neglected 
tails of infinite support distributions in the examples. 
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Fig. 3. Encoder mapping for bi-modal GMM source, Gaussian channel, modes at 3 and —3 as in \62) . 
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Fig. 4. Comparative results: PQCOVQ vs. the proposed method. 



A. Scalar Mappings (m = l,k = I), Gaussian Mixture Source and Gaussian Channel 
We consider a Gaussian mixture source with distribution 

and unit variance Gaussian noise. The encoder and decoder mappings for this source-channel setting are given in Figure [3] As 
intuitively expected, since the two modes of the Gaussian mixture are well separated, each mode locally behaves as Gaussian. 
Hence the curve can be approximated as piece-wise linear and deviates significantly from a truly linear mapping. This illustrates 
the importance of nonlinear mappings for general distributions that diverge from the pure Gaussian. 



B. A Numerical Comparison with Vector Quantizer Based Design 

In the following, we compare the proposed approach to the power constrained channel optimized vector quantization 
(PQCOVQ) based approach which first discretizes the problem, numerically solves the discrete problem and then linearly 
interpolates between the selected points. The main difference between our approach and PQCOVQ based approaches is that 
we derive the necessary conditions of optimality in the original, "analog" domain without any discretization. This allows 
not only a theoretical analysis of the problem but also enables a different numerical method which iteratively imposes the 
optimality conditions of the "original problem". However, PQCOVQ will arguably approximate the solution at asymptotically 
high sampling (discretization) resolution. 
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(a) CSNR=-5.70dB 



(b) CSNR=-1.69dB 



(c) CSNR=5.69dB 



Fig. 5. This figure shows the optimal encoder at various CSNR values when X ~ JV(0, 1) and Z is distributed uniformly on the interval [—1, 1] and CSNR 
is varied by changing power, P. Observe that the optimal encoder converges to linear as CSNR increases. 
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1.5 2 2 5 



(a) CSNR=-5.70dB 



(b) CSNR=-1.69dB 



(c) CSNR=5.69dB 



Fig. 6. This figure shows the optimal decoder (estimator) at various CSNR (in dB) values. Observe that the optimal decoder, similar to the optimal encoder 
in Figure 5, converges to linear as CSNR increases. 



To compare our method to PQCOVQ, we consider our running example of Gaussian mixture source and Gaussian channel. 
For both methods, we use 10 sampling points for the encoder mapping. The main difference is due to two facts: i) The 
proposed method is based on the necessary condition derived in the "original" analog domain, and discretization is merely 
used to perform the ultimate numerical operations. On the other hand, PQCOVQ defines a "discretized" version of the problem 
from the outset, with the implicit assumption that the discretized problem, at sufficiently high resolution, approximates well the 
original problem. Hence, although both methods eventually optimize and then interpolate a discrete set of points, the proposed 
algorithm finds the values of these points while accounting for the fact that they will eventually be (linearly) interpolated. 
PQCOVQ does not account for eventual interpolation and merely solves the discrete problem, ii) Since we consider the problem 
in its original domain, we naturally use the optimal decoder, namely, conditional expectation. The PQCOVQ method uses the 
standard maximum likelihood method for decoding, see e.g. |15|. 

The numerical comparisons are shown in Figure [4] As expected, the proposed method outperforms PQCOVQ for the entire 
range of CNSRs in this resolution constrained setting of 10 samples. We note that the performance difference diminishes at 
higher sampling resolution. The purpose of this comparison is to demonstrate the conceptual difference between these two 
approaches at finite resolution while acknowledging that the proposed method does not provide gains at asymptotically high 
resolution. 

C. A Numerical Example for Theorem 7 

Let us consider a numerical example that illustrates the findings in Theorem 7. Consider a setting where X is Gaussian with 
unit variance and Z is Gaussian with unit variance, i.e., Z ~ A/"(0, 1). We change 7 (CSNR) by varying allowed power Py, 
and observe how the optimal mappings behave for different 7. We numerically calculated the optimal mappings by discretizing 
the integrals on a uniform grid, with a step size A = 0.01, i.e., to obtain the numerical results, we approximated the integrals 
as Riemann sums. Figures 5 and 6 respectively show how the optimal encoder and decoder mappings converge to linear as 
CSNR increases. Note that at 7 = —5.70, optimal mappings are both highly nonlinear while at 7 = 5.69, they practically 
converge to linear, as theoretically anticipated from Theorem [7] 
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Obtained mapping 




Fig. 7. Encoder 2:1 mapping for unit variance Gaussian source and channel, at CSNR=40dB, SNR=19.41dB. The axes show the two dimensional input (x) 
and the function value (g(x)) is reflected in the intensity level. 
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Fig. 8. Comparative results for Gaussian source-channel, 2: 1 mapping. 



D. (m = 2,k = 1) Gaussian source-channel mapping 

In this section, we present a bandwidth compression example with 2:1 mappings for Gaussian vector source of size two 
(source samples are assumed to be independent and identically distributed with unit variance) and scalar Gaussian channel, to 
demonstrate the effectiveness of our algorithm in differing source and channel dimensions. We compare the proposed mapping 
to the asymptotic bound (OPTA) and prior work |45|. We also compare the optimal encoder-decoder pair to the setting where 
only the decoder is optimized and the encoder is fixed. In prior work (8), flO) , ]45| , the Archimedian spiral is found to perform 
well for Gaussian 2:1 mappings, and used for encoding and decoding with maximum likelihood criteria. We hence initialize 
our algorithm with an Archimedian spiral (for the encoder mapping). For details of the Archimedian spiral and its settings, 
see e.g. p3] and references therein. 

The obtained encoder mapping is shown in Figure [7] While the mapping produced by our algorithm resembles a spiral, 
it nevertheless differs from the Archimedian spiral, as will also be evident from the performance results. Note further that 
the encoding scheme differs from prior work in that we continuously map the source to the channel signal, where the two 
dimensional source is mapped to the nearest point on the space filling spiral. The comparative performance results are shown in 
Figure [8] The proposed mapping outperforms the Archimedian spiral |45j over the entire range of CSNR values. It is notable 
that the "intermediate" option of only optimizing the decoder captures a significant portion of the gains. 
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(a) CSNR=I0dB, p = 0.97 (b) CSNR=22dB, p = 0.97 (c) CSNR=10dB, p = 0.9 (d) CSNR=22dB, p = 0.9 

Fig. 9. Encoder mappings for Gaussian scalar source, channel and side information at different CSNR and correlation levels 
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Fig. 10. Comparative results for correlation coefficient p = 0.9, Gaussian scalar source, channel and side information 



E. Source-Channel Coding with Decoder Side Information 

In this section, we demonstrate the use of the proposed algorithm by focusing on the specific scenario of Figure 2. It must 
be emphasized that, while the algorithm is general and directly applicable to any choice of source and channel dimensions and 
distributions, for conciseness of the results section we will assume that sources are jointly Gaussian scalars with correlation 
coefficient p, where \p\ < 1, and are identically distributed as described in Section II. B. We also assume that the noise is scalar 
and Gaussian. 

Figure [9] presents a sample of encoding mappings obtained by varying the correlation coefficient and CSNR. Interestingly, 
the analog mapping captures the central characteristic observed in digital Wyner-Ziv mappings, in the sense of many-to-one 
mappings, where multiple source intervals are mapped to the same channel interval, which will potentially be resolved by the 
decoder given the side information. Within each bin, there is a mapping function which is approximately linear in this case 
(scalar Gaussian sources and channel). To see the effect of correlation on the encoding mappings, we note how the mapping 
changes as we lower the correlation from p = 0.97 to p = 0.9. As intuitively expected, the side information is less reliable and 
source points that are mapped to the same channel representation grow further apart from each other. Comparative performance 
results are shown in Figure 10 The proposed mapping significantly outperforms linear mapping over the entire range of CSNR 
values. We note that this characteristic of the encoding mappings was also noted in experiments with the PCCOVQ approach 
in 1 15 1, and was implemented in |46), for optimizing hybrid (digital+analog) mappings. 



VII. Discussion and Future Work 

In this paper, we studied the zero-delay source-channel coding problem. First, we derived the necessary conditions for 
optimality of the encoding and decoding mappings for a given source-channel system. Based on the necessary conditions, 
we proposed an iterative algorithm which generates locally optimal encoder and decoder mappings. Comparative results and 
example mappings are provided and it is shown that the proposed method improves upon the results of prior work. The 
algorithm does not guarantee a globally optimal solution. This problem can be largely mitigated by using more powerful 
optimization, in particular a deterministic annealing approach j47j. Moreover, we investigated the functional properties of the 
zero-delay source-channel coding problem. We specialized to the scalar setting and showed that the problem is concave in the 
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source and the channel noise densities and convex in the channel input density. Then, using these functional properties and the 
necessary conditions of optimality we had derived, we obtained the necessary and sufficient condition for linearity of optimal 
mappings. We also studied the implications of this matching condition and particularly showed that the optimal mappings 
converge to linear at asymptotically high CSNR for a Gaussian source, irrespective of the channel density and similarly for a 
Gaussian channel, at asymptotically low CSNR, irrespective of the source. 

The numerical algorithm presented in this paper is feasible for relatively low source and channel dimensions (m, k). For 
high dimensional vector spaces, the numerical approach should be supported by imposing a tractable structure to the mappings, 
to mitigate the problem of the dimensionality. A set of preliminary results in this direction recently appeared in [48], where 
a linear transformation followed by scalar non-linear mappings were utilized for the decoder side information setting. The 



purely linear solution had been investigated in [49|, where numerical algorithms are proposed to find the optimal bandwidth 
compression transforms in network settings. 

The analysis in this paper, specifically conditions for linearity (and generalizations to other structural forms) of optimal 
mappings, as well as the numerical approach, can be extended to well known control problems such as the optimal jamming 
problem p0[ and Witsenhausen's counterexample (27), (28) . 

An interesting question is on the existence of structure in the optimal mappings in some fundamental scenarios. For instance, 
in [46 1, a hybrid digital-analog encoding was employed for the problem of zero-delay source-channel coding with decoder side 
information, where the source, the side information and the channel noise are all scalar and Gaussian. The reported performance 
results are very close to the performance of the optimal unconstrained mappings. In contrast, in (15), sawtooth-like structure 
was assumed and its parameters were optimized as well as PCCOVQ was employed to obtain the non-structured mappings, 
where non-negligible performance difference between these approaches was reported. Hence, this fundamental question, on 
whether the optimal zero-delay mappings are structured for the scalar Gaussian side information setting, is currently open. 
This problem can numerically be approached by employing a powerful non-convex optimization tool, such as deterministic 
annealing (47), and this approach is currently under investigation. 

Appendix A 

Optimality Conditions for Coding with Decoder Side Information 
Let the encoder g(-) be fixed. Then, the optimal decoder is the MMSE estimator of X\ given X 2 and Y: 

h(y,x 2 )=E{X 1 \y,x 2 }. (63) 

Plugging the expressions for expectation, applying Bayes' rule and noting that fy\X! (V' x ~0 = fz[y ~ 9( x i)]> the optimal 
decoder can be written, in terms of known quantities, as 

J fx 1 ,x 2 {xi,x 2 ) fz[y - g{xi)\dx 1 
To derive the necessary condition for optimality of g(-), we consider the distortion functional 

D[g,h] = E{\\[X 1 -h(g(X 1 ) + Z,X 2 )}\\ 2 }, (65) 
and construct the Lagrangian cost functional: 

J[g,h}=D[g,h}+XP[g}. (66) 

Now, let us assume the decoder h(-) is fixed. To obtain necessary conditions we apply the standard method in variational 
calculus: 

V 9 J[fl,/i](a;i,a; 2 ) = 0, Vx!,a; 2 , (67) 

where 

V g J[g, h](x 1 ,x 2 ) = \fx u x 2 {xi, x 2 )g(x 1 )- J h' (g(xi)+z, x 2 ) [x-h(g(x)+z, x 2 )] f z (z)fx u x 2 (x 1 ,x 2 )dz. (68) 
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